I've been using AWS Rekognition to detect text written in images. AWS Rekognition charges every time I ask it to determine the text contained in an image even if the image doesn't contain any text. If there was a way to determine that an image contains text before I attempt to call AWS Rekognition, I could reduce my costs.
When I started I didn't have any existing data or information about how many of the images contained text that could be recognized. I decided to call Rekognition on every image for the first month to create some data. In that month I gathered results on 696,790 unique images. Unique in this context means that the SHA256 and ResNet50 feature vectors differed.
AWS Rekognition provides a number of API's such as object detection, face detection but the function I'm using is DetectText. This API method recognizes text inside of images larger than 80x80 pixels. DetectText returns a list of detected words and lines of words. It also returns a confidence score about each detection. This will be useful as I only want to consider highly confident and therefor accurate detections.
There are number of additional statistics that can be calculated on the results of AWS Rekognition. The best way to view them is using a histogram from pandas but first lets get some basic statistics.
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
import math
import pandas as pd
import os
import glob
import random
import matplotlib.image as mpimg
from matplotlib import colors
# All of the stats are stored in a file extracted from other data.
detection_results = pd.read_csv('image-stats.csv', index_col=False)
detection_results.describe()
There are outliers in width and height columns that should be dealt with as they will skew some of the output histograms. Images that are 696,790 pixels wide aren't likely to be very relevant.
Lets see the histograms of the filtering by an image having a width of < 1000 and a height of < 1000 pixels.
detection_results[(detection_results.width < 1000) &
(detection_results.height < 1000)].hist(figsize=(20,20), bins=25)
A few interesting observations should be made about this data.
Bounding boxes can actually be larger than the source image size, if Rekognition detects text it may presume the bounding box is outside the image dimensions. I believe this may help with truncated letters.
There supposedly a 50 word limit for DetectText, but that seems to have been exceeded.
# Break down the images into two classes.
# I'm interested in images that contain more than one with with more than five characters
# and have a highest confidence greater than 99%.
#
# You may have different desired thresholds.
#
has_text = detection_results[(
(detection_results.totalWords >1) &
(detection_results.totalCharacters > 5) &
(detection_results.highestConfidence >= 99))]
# missing_text are images that don't pass the classification criteria.
# In actuality these images they may still contain
# text but they may only have one word or 5 or less characters
# or not have a high enough confidence value.
missing_text = detection_results[~(
(detection_results.totalWords >1) &
(detection_results.totalCharacters > 5) &
(detection_results.highestConfidence >= 99))]
# Show the statistics for images that contain text.
has_text.describe()
has_text.hist(figsize=(20,20), bins=50)
missing_text.describe()
# Load and display some images with associated data.
def show_images(rows):
images = []
for index, img_path_record in rows.iterrows():
images.append([mpimg.imread(img_path_record['img_filename']), img_path_record])
plt.figure(figsize=(20,40))
plt.axis('off')
columns = 4
for i, record in enumerate(images):
image = record[0]
info = record[1]
plt.subplot(len(images) / columns + 1, columns, i + 1)
plt.axis('off')
plt.title("Confidence: {}, Words: {}, Chars: {}".format(round(info['highestConfidence'],2), info['totalWords'], info['totalCharacters']))
plt.imshow(image)
show_images( has_text.sample(frac=1)[0:50])
show_images(missing_text.sample(frac=1)[0:50])
Now that the images have been split into two classes, one which should be sent to Rekognition and another that shouldn't. The classes are roughly equal in size so there shouldn't be much of a class imbalance problem.
369610 images pass the classifier and 327180 do not pass the classifier. ~53% pass and ~47% don't pass.
To create the classifier I tried a few machine learning techniques from the laziest to the least lazy:
Transfer learning. Use the lower layers of an existing convolutional neural network (CNN) (i.e. ResNet50) and add additional layers. This yielded an accuracy of 86% on a smaller extracted test dataset.
Transfer learning with a gradient boosted tree. Convert the images to feature vectors using an existing CNN (i.e. ResNet50) then attempt use xgboost to create a gradient boosted tree. This can is a bit like transfer learning and trees combined. It yielded an accuracy of 88% on a smaller extracted test dataset.
Train a convolutional neural network specially for this task. This approach was most successful and will be explained in the rest of this document.
For creating the classifier I will be using Tensorflow 2.0. Lets start by importing all of the things that will be needed.
from __future__ import absolute_import, division, print_function, unicode_literals
from tensorflow.keras import datasets, layers, models
from sklearn.model_selection import train_test_split
import tensorflow as tf
import pickle
import pathlib
import math
import datetime
import os
from sklearn import metrics
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
from sklearn.metrics import precision_recall_curve
sns.set(font_scale=2)
AUTOTUNE = tf.data.experimental.AUTOTUNE
tf.__version__
%matplotlib inline
All of the images are located in two directories, ./image-classifier-dataset/true and ./image-classifier-dataset/false respectively based on if the image passed the classification criteria.
I've chosen to split the data into training, test and validation sets.
The test set will be 10% of the data with validation being 1%.
This is a pretty liberal allocation but since I've trained the model multiple times I know that I'd rather use most of my data in the training set rather than the test set.
# The images have been saved to a directory structure that looks like
# image-classifier-dataset/
# true/ - Images that should pass the classifier
# false/ - Images that should not pass the classifier
root = pathlib.Path('./image-classifier-dataset/')
images = [str(path) for path in list(root.glob('*/*'))].sort()
# Create the integer labels for each image by looking at the parent directory
# name of the image.
#
# 0 will represent that the image should not pass the classifier
# 1 will represent that the image should pass the classifier
labels = [0 if pathlib.Path(path).parent.name == 'false' else 1
for path in images]
total_images = len(images)
# Determine the size of the test set
test_size = math.floor(total_images*0.10)
# Determine the size of the validation set.
validation_size = math.floor(total_images*0.01)
# For reproducible results.
RANDOM_SEED = 2019
X_training, X_test,
Y_training, Y_test = train_test_split(
images,
labels,
test_size=test_size,
random_state=RANDOM_SEED)
X_training, X_validation,
Y_training, Y_validation = train_test_split(
X_training,
Y_training,
test_size=validation_size,
random_state=RANDOM_SEED)
print("Training size: {} Test size: {} Validation size: {}".format(
len(X_training),
len(X_test),
len(X_validation)))
# Save the contents of the test, training and validation sets.
pickle.dump(X_training, open("training_image_paths.p", "wb"))
pickle.dump(X_test, open("test_image_paths.p", "wb"))
pickle.dump(X_validation, open("validation_image_paths.p", "wb"))
The images themselves have already been scaled to 224 by 224 pixels. This dimension was convenient since it is the input dimensions of the ResNet50 pretrained CNN which was used for the initial experiments with transfer learning.
Once the image has been loaded, the pixel values themselves are mapped from a range of 0-255 to a float that is zero centered and ranges from -1 to 1. This is a very standard process to perform when using Tensorflow with images data.
The Tensorflow documentation recommends that image pre-processing be performed once and cached to speed up the image pipeline. I've found that the disk size needed to serialize all of the preprocessed images as zero centered floats isn't reasonable.
The hardware that I will be using to train this neural network offers multiple CPU cores, in my experience it seems that Tensorflow can adequately keep the GPU saturated with data while performing task of loading and decoding the JPEG images and converting the pixel intensities to zero centered float values.
# Load the image, resize if necessary, then zero center and norm between -1 and 1.
# this is pretty standard for neural networks.
def load_and_preprocess_image(path):
image = tf.io.read_file(path)
image = tf.image.decode_jpeg(image, channels=3)
image = tf.image.resize(image, [224, 224])
image /= 127.5
image -= 1.
return image
To perform training we need to create a tf.Dataset that combines the image along with the actual expected label (or classification value 1 for true 0 for false).
Additionally for training and validation the tf.Datasets should form batches of examples. Batches are collections of images and labels at a fixed size which are used to adjust the gradients of the neural network. For now I've chosen a batch size of 128 images, this means that the gradient for learning will be averaged over the results of 128 images before weights are adjusted. Choosing a smaller batch size may allow more accuracy but does increase the necessary training time.
# Determine the batch size for training. How many images will be considered
# before adjusting the weights in the direction of the gradient.
BATCH_SIZE = 128
def createDatasets(X, Y):
ds = tf.data.Dataset.from_tensor_slices(X)
X_ds = ds.map(
load_and_preprocess_image,
num_parallel_calls=AUTOTUNE)
Y_ds = tf.data.Dataset.from_tensor_slices(
tf.cast(Y_test, tf.int8))
return X_ds, Y_ds
X_test_ds, Y_test_ds = createDatasets(X_test, Y_test)
test_ds = tf.data.Dataset.zip((X_test_ds, Y_test_ds))
X_training_ds, Y_training_ds = createDatasets(X_training, Y_training)
# Training neural nets benefit from being presented random batches, but shuffling
# all of the data could be very expensive in the aspect of time, so just shuffle
# 128 times the batch size.
training_ds = tf.data.Dataset.zip((X_training_ds, Y_training_ds))
.shuffle(
buffer_size=BATCH_SIZE*128,
reshuffle_each_iteration=True)
.repeat()
.batch(BATCH_SIZE)
.prefetch(buffer_size=AUTOTUNE)
X_validation_ds, Y_validation_ds = createDatasets(X_validation, Y_validation)
validation_ds = tf.data.Dataset.zip((X_validation_ds, Y_validation_ds))
validation_batches = validation_ds.batch(BATCH_SIZE)
test_batches = test_ds.batch(BATCH_SIZE)
.prefetch(buffer_size=AUTOTUNE)
Deciding on the neural network architecture is a problem that doesn't have a clear path from start to finish. Choices need to be made about the number of layers, type of layers, the parameters of the layers, activation functions and the choice of optimizers which determine the learning rate schedule.
If the neural network has too many parameters it will result in overfitting of the training set, if it has too few it will not be able to learn the generic patterns effectively.
from tensorflow.keras import datasets, layers, models
model = models.Sequential([
layers.Conv2D(24, 3, padding='same', activation=layers.ELU(
alpha=1.0), input_shape=(224, 224, 3)),
layers.MaxPooling2D(),
layers.Conv2D(64, 3, padding='same', activation=layers.ELU(alpha=1.0)),
layers.BatchNormalization(),
layers.Conv2D(64, 5, padding='same', activation=layers.ELU(alpha=1.0)),
layers.MaxPooling2D(),
layers.Conv2D(64, 7, padding='same', activation=layers.ELU(alpha=1.0)),
layers.MaxPooling2D(),
layers.BatchNormalization(),
layers.Conv2D(64, 5, padding='same', activation=layers.ELU(alpha=1.0)),
layers.MaxPooling2D(),
layers.Conv2D(8, 3, padding='same', activation=layers.ELU(alpha=1.0)),
layers.MaxPooling2D(),
layers.BatchNormalization(),
layers.Flatten(),
layers.Dense(64, activation=layers.ELU(alpha=1.0)),
layers.Dense(1, activation='sigmoid')
])
model.compile(optimizer="rmsprop",
loss='binary_crossentropy',
metrics=[
"accuracy"
]
)
model.summary()
The model is now defined it is now time to train the network using the training set of the images.
I've chosen to only train for 7 epochs since in previous testing that was sufficient for the level of accuracy I needed.
steps_per_epoch = (math.ceil(len(training_all_image_paths)/BATCH_SIZE))
log_dir = "logs/fit/" + datetime.datetime.now().strftime("%Y%m%d-%H%M%S")
tensorboard_callback = tf.keras.callbacks.TensorBoard(
log_dir=log_dir, histogram_freq=1)
checkpoint_path = "train.weights.{epoch:02d}-{val_loss:.2f}.hdf5"
checkpoint_dir = os.path.dirname(checkpoint_path)
cp_callback = tf.keras.callbacks.ModelCheckpoint(filepath=checkpoint_path,
save_weights_only=True,
verbose=1)
model.fit(
training_ds,
epochs=7,
steps_per_epoch=steps_per_epoch,
validation_data=validation_batches,
callbacks=[cp_callback, tensorboard_callback],
)
model.save('train.h5')
The model is now trained on all of the data for 7 epochs it is now time to evaluate its performance against the set of test images and labels that the neural net was not trained on.
from sklearn.metrics import precision_recall_curve
Y_predictions = model.predict(test_batches)
precision, recall, thresholds = precision_recall_curve(Y_test, Y_predictions)
def plot_precision_recall_vs_threshold(precisions, recalls, thresholds):
"""
Modified from:
Hands-On Machine learning with Scikit-Learn
and TensorFlow; p.89
"""
plt.figure(figsize=(5, 5))
plt.title("Precision and Recall Scores as a function of the decision threshold")
plt.plot(thresholds, precisions[:-1], "b--", label="Precision")
plt.plot(thresholds, recalls[:-1], "g-", label="Recall")
plt.ylabel("Score")
plt.ylim([0.90, 1.0])
plt.xlim([0.00, 0.85])
plt.xlabel("Decision Threshold")
plt.legend(loc='best')
plot_precision_recall_vs_threshold(precision, recall, thresholds)
The output of the neural network is floating point value that is the probability that the image passes the classification criteria. The probability has a range of 0 to 1. All of the training and test examples are labeled 0 or 1 exactly representing if the image passes the classification criteria or not. A threshold value must be chosen that converts the probability into value of 0 or 1. If the threshold were chosen to be 0.5 it would mean if the probability generated by the neural network is greater than or equal to 0.5 the output will be considered to be 1 otherwise the value is 0.
The choice of the threshold value should be driven by the desired performance of the neural network with regards to precision and recall. If the threshold value was chosen to be a high number such as 0.90 would require the neural network to produce a high probability value for images to be considered to be classified as passing. The high threshold value may cause some images that should be classified as passing the criteria to not be because the neural network isn't as confident in that prediction. This is known as a false positive.
As a counterpoint if the threshold is too low images will be indicated as passing the classification criteria when they should not. This in turn means that more calls to AWS Rekognition will be made. But it also means that there will be a smaller number of images that will be false positives.
The choice of the threshold value is a tradeoff between there opposing situations.
I'd like to minimize false negatives while not sacrificing overall accuracy. I'd like to have recall value >= 0.95. As such I have chosen a threshold value of: 0.4325.
from sklearn import metrics
import seaborn as sns
THRESHOLD = 0.4325
LABELS = ['False', 'True']
max_test = Y_test
max_predictions = [1 if pred >= THRESHOLD else 0 for pred in Y_predictions]
confusion_matrix = metrics.confusion_matrix(max_test, max_predictions)
plt.figure(figsize=(5, 5))
sns.heatmap(confusion_matrix,
xticklabels=LABELS,
yticklabels=LABELS,
annot=True,
fmt="d",
annot_kws={"size": 20});
plt.title("Confusion matrix", fontsize=20)
plt.ylabel('Actual label', fontsize=20)
plt.xlabel('Predicted label', fontsize=20)
plt.show()
values = confusion_matrix.view()
error_count = values.sum() - np.trace(values)
precision = values[0][0]/(values[0][0]+values[0][1])
recall = values[0][0]/(values[0][0]+values[1][0])
print("Precision: %", precision)
print("Recall: %", recall)
print("Accuracy: %", 1 - (error_count/len(Y_predictions)))
print("Error Rate: %", error_count/len(Y_predictions))
This means that the model will only have about a ~5% false positive rate. Meaning it will not send images to Rekognition 5% of the time that it should. On the other hand it will be accurate 91.6% of the time when it does send images to Rekognition.
How would have this performed on the first dataset?
total_images = (len(missing_text) + len(has_text))
true_text_percentage = len(has_text) / (len(missing_text) + len(has_text))
false_text_percentage = len(missing_text) / (len(missing_text) + len(has_text))
# The base percentage of the population, plus the percentage of false positives, subtracting the rate of false negatives.
total_call_percentage = true_text_percentage + (1-precision) - (1 - recall)
predicted_calls = math.ceil(total_call_percentage * total_images)
print("Predicted calls to Rekognition: {}, predicted Rekognition calls saved: {}".format(predicted_calls, total_images-predicted_calls))
print("Savings rate: {}% of unnecessary calls".format(round((total_images-predicted_calls)/total_images*100, 2)))
print("Actual false rate: {}%".format(round(false_text_percentage*100, 2)))
The model will have about 43.6% of calls to Rekognition, which is better than having no model and making all calls. The test set has a rate of 46.96% of images without text, so this is an acceptable level of performance to me.
Now that the model has been trained and the threshold has been selected. The model should be saved so that it can be used by Tensorflow serving to process requests.
The code below will load the model from train.h5 and then save it in a directory called text-detect as version 1 of the model. If there were multiple versions of the same model you could increment the version number.
import keras.backend.tensorflow_backend as K
K.set_session
model = tf.keras.models.load_model(
'train.h5', custom_objects={'ELU': tf.keras.layers.ELU})
tf.keras.experimental.export_saved_model(
model, './text-detect/1', custom_objects={'ELU': tf.keras.layers.ELU})
There is some configuration for Tensorflow serving to work with multple models. I've added the text detection model such that my model.conf looks like this.
model_config_list: {
config: {
name: "resnet50",
base_path: "/models/resnet50",
model_platform: "tensorflow"
},
config: {
name: "xception",
base_path: "/models/xception",
model_platform: "tensorflow"
},
config: {
name: "text-detect",
base_path: "/models/text-detect",
model_platform: "tensorflow"
}
}
Next start the Tensorflow serving docker container like this.
#!/bin/bash
export MODELDIR=[FILL IN WITH YOUR MODEL DIR]
docker run --rm \
-p 8501:8501 \
-v "$MODELDIR/resnet-classifier:/models/resnet50" \
-v "$MODELDIR/xception-classifier:/models/xception" \
-v "$MODELDIR/text-detect:/models/text-detect" \
-v "$MODELDIR/model.config:/model.config" tensorflow/serving --enable_batching --model_config_file=/model.config
The model now works but the input is a 224 by 224 pixel matrix with a separate 32 bit float input for each pixel's red, green and blue value. This is quite unweildly especially since I'm using the HTTP interface to the Tensorflow model server.
To increase efficiency I'd like to convert the model to be part of an estimator that can take a Base64 encoded JPEG image as input. This means that there will be less data sent to the model server since a JPEG is a compressed format. To do this I use code:
import os
import tensorflow as tf
import keras.backend.tensorflow_backend as K
K.set_session
model = tf.keras.models.load_model(
'text-detect-golden.h5', custom_objects={'ELU': tf.keras.layers.ELU})
WIDTH = 224
HEIGHT = 224
CHANNELS = 3
def image_preprocessing(image):
image = tf.expand_dims(image, 0)
image = tf.image.resize(
image, [WIDTH, HEIGHT], method=tf.image.ResizeMethod.BILINEAR)
image = tf.squeeze(image, axis=[0])
image = tf.cast(image, dtype=tf.float32)
image = (image - 127.5) / 127.5
return image
def serving_input_receiver_fn():
def prepare_image(image_str_tensor):
image = tf.image.decode_jpeg(image_str_tensor, channels=CHANNELS)
return image_preprocessing(image)
# TensorFlow will have already converted the list string into a numeric tensor
# make sure this runs conversion from JPEG to float on the CPU and not the GPU
with tf.device('/cpu:0'):
input_ph = tf.compat.v1.placeholder(tf.string, shape=[None])
images_tensor = tf.map_fn(
prepare_image, input_ph, back_prop=False, dtype=tf.float32)
return tf.estimator.export.ServingInputReceiver(
{'conv2d_input': images_tensor},
{'image': input_ph})
estimator = tf.keras.estimator.model_to_estimator(
keras_model=model,
custom_objects={'ELU': tf.keras.layers.ELU}
)
estimator.export_saved_model(
"./text-detect-estimator/1/",
serving_input_receiver_fn=serving_input_receiver_fn,
)
By replacing the original model with the estimator and serving it using the model server I can now test the results by passing a Base64 encoded JPEG image in JSON via curl.
{
"signature_name": "serving_default",
"instances": [
{ "b64": "/9j/4AAQSkZJRgABAQAAAQABAAD/..." }
]
}
Perform the prediction using:
curl -X POST -d @test.json http://localhost:8501/v1/models/text-detect:predict
Making the change to use a Keras estimator that decodes the JPEG increased the inference performance. Originally the fastest inference took 150ms per single image. Using the estimator the inference time is now 40ms using local CPU. This is using aggressive batching by the model server.
If I were to use a GPU the inference time could be as low as 4ms with aggressive batching of images.
Thanks to Sam Brice for reviewing and sending valuable suggestions.